AITopics | mechanistic study

Collaborating Authors

mechanistic study

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Small Vectors, Big Effects: A Mechanistic Study of RL-Induced Reasoning via Steering Vectors

Sinii, Viacheslav, Balagansky, Nikita, Gerasimov, Gleb, Laptev, Daniil, Aksenov, Yaroslav, Kurochkin, Vadim, Gorbatovski, Alexey, Shaposhnikov, Boris, Gavrilov, Daniil

arXiv.org Artificial IntelligenceOct-2-2025

The mechanisms by which reasoning training reshapes LLMs' internal computations remain unclear. We study lightweight steering vectors inserted into the base model's residual stream and trained with a reinforcement-learning objective. These vectors match full fine-tuning performance while preserving the interpretability of small, additive interventions. Using logit-lens readouts and path-patching analyses on two models, we find that (i) the last-layer steering vector acts like a token-substitution bias concentrated on the first generated token, consistently boosting tokens such as "To" and "Step"; (ii) the penultimate-layer vector leaves attention patterns largely intact and instead operates through the MLP and unembedding, preferentially up-weighting process words and structure symbols; and (iii) middle layers de-emphasize non-English tokens. Next, we show that a SAE isolates features associated with correct generations. We also show that steering vectors (i) transfer to other models, (ii) combine across layers when trained in isolation, and (iii) concentrate magnitude on meaningful prompt segments under adaptive token-wise scaling. Taken together, these results deepen understanding of how trained steering vectors shape computation and should inform future work in activation engineering and the study of reasoning models.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.06608

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Iteration Head: A Mechanistic Study of Chain-of-Thought

Neural Information Processing SystemsMay-27-2025, 15:47:26 GMT

Chain-of-Thought (CoT) reasoning is known to improve Large Language Models both empirically and in terms of theoretical approximation power.However, our understanding of the inner workings and conditions of apparition of CoT capabilities remains limited.This paper helps fill this gap by demonstrating how CoT reasoning emerges in transformers in a controlled and interpretable setting.In particular, we observe the appearance of a specialized attention mechanism dedicated to iterative reasoning, which we coined "iteration heads".We track both the emergence and the precise working of these iteration heads down to the attention level, and measure the transferability of the CoT skills to which they give rise between tasks.

chain-of-thought, iteration head, mechanistic study, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.31)

Add feedback

What Makes and Breaks Safety Fine-tuning? A Mechanistic Study

Neural Information Processing SystemsMay-27-2025, 12:04:34 GMT

Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. To better understand the underlying factors that make models safe via safety fine-tuning, we design a synthetic data generation framework that captures salient aspects of an unsafe input by modeling the interaction between the task the model is asked to perform (e.g., "design") versus the specific concepts the task is asked to be performed upon (e.g., a "cycle" vs. a "bomb"). Using this, we investigate three well-known safety fine-tuning methods--supervised safety fine-tuning, direct preference optimization, and unlearning--and provide significant evidence demonstrating that these methods minimally transform MLP weights to specifically align unsafe inputs into its weights' null space. This yields a clustering of inputs based on whether the model deems them safe or not. Correspondingly, when an adversarial input (e.g., a jailbreak) is provided, its activations are closer to safer samples, leading to the model processing such an input as if it were safe.

make and break safety fine-tuning, mechanistic study, unsafe input

Neural Information Processing Systems

Country: Europe > Latvia > Lubāna Municipality > Lubāna (0.09)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.63)

Add feedback

Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task

Golkar, Siavash, Bietti, Alberto, Pettee, Mariel, Eickenberg, Michael, Cranmer, Miles, Hirashima, Keiya, Krawezik, Geraud, Lourie, Nicholas, McCabe, Michael, Morel, Rudy, Ohana, Ruben, Parker, Liam Holden, Blancard, Bruno Régaldo-Saint, Cho, Kyunghyun, Ho, Shirley

arXiv.org Machine LearningMay-30-2024

Transformers have revolutionized machine learning across diverse domains, yet understanding their behavior remains crucial, particularly in high-stakes applications. This paper introduces the contextual counting task, a novel toy problem aimed at enhancing our understanding of Transformers in quantitative and scientific contexts. This task requires precise localization and computation within datasets, akin to object detection or region-based scientific analysis. We present theoretical and empirical analysis using both causal and non-causal Transformer architectures, investigating the influence of various positional encodings on performance and interpretability. In particular, we find that causal attention is much better suited for the task, and that no positional embeddings lead to the best accuracy, though rotary embeddings are competitive and easier to train. We also show that out of distribution performance is tightly linked to which tokens it uses as a bias term.

contextual, sequence, transformer, (15 more...)

arXiv.org Machine Learning

2406.02585

Country:

North America > United States > New York (0.04)
North America > United States > Colorado > Boulder County > Boulder (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (0.88)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback